10 research outputs found

    Molecular Distance Maps: An alignment-free computational tool for analyzing and visualizing DNA sequences\u27 interrelationships

    Get PDF
    In an attempt to identify and classify species based on genetic evidence, we propose a novel combination of methods to quantify and visualize the interrelationships between thousand of species. This is possible by using Chaos Game Representation (CGR) of DNA sequences to compute genomic signatures which we then compare by computing pairwise distances. In the last step, the original DNA sequences are embedded in a high dimensional space using Multi-Dimensional Scaling (MDS) before everything is projected on a Euclidean 3D space. To start with, we apply this method to a mitochondrial DNA dataset from NCBI containing over 3,000 species. The analysis shows that the oligomer composition of full mtDNA sequences can be a source of taxonomic information, suggesting that this method could be used for unclassified species and taxonomic controversies. Next, we test the hypothesis that CGR-based genomic signature is preserved along a species\u27 genome by comparing inter- and intra-genomic signatures of nuclear DNA sequences from six different organisms, one from each kingdom of life. We also compare six different distances and we assess their performance using statistical measures. Our results support the existence of a genomic signature for a species\u27 genome at the kingdom level. In addition, we test whether CGR-based genomic signatures originating only from nuclear DNA can be used to distinguish between closely-related species and we answer in the negative. To overcome this limitation, we propose the concept of ``composite signatures\u27\u27 which combine information from different types of DNA and we show that they can effectively distinguish all closely-related species under consideration. We also propose the concept of ``assembled signatures\u27\u27 which, among other advantages, do not require a long contiguous DNA sequence but can be built from smaller ones consisting of ~100-300 base pairs. Finally, we design an interactive webtool MoDMaps3D for building three-dimensional Molecular Distance Maps. The user can explore an already existing map or build his/her own using NCBI\u27s accession numbers as input. MoDMaps3D is platform independent, written in Javascript and can run in all major modern browsers

    An investigation into inter- and intragenomic variations of graphic genomic signatures

    Get PDF
    We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences originating from the genomes of six organisms, each belonging to one of the kingdoms of life: H. sapiens, S. cerevisiae, A. thaliana, P. falciparum, E. coli, and P. furiosus. We also provide preliminary evidence of this method's applicability to closely related species by comparing H. sapiens (chromosome 21) sequences and over one hundred and fifty genomic sequences, also 150,000 bp long, from P. troglodytes (Animalia; chromosome Y), for a total length of more than 101 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps that visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Our analysis confirms that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our analysis of the performance of the assessed distances uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies. In particular we show that, for this dataset, DSSIM (Structural Dissimilarity Index) and the descriptor distance (introduced here) are best able to classify genomic sequences.Comment: 14 pages, 6 figures, 5 table

    Mapping the Space of Genomic Signatures

    Full text link
    We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to kk (herein k=9k=9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence homology and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information.Comment: 14 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1307.375

    Additional file 1 of Additive methods for genomic signatures

    No full text
    The 42 genomes analyzed in the Results section, and the two genomes exemplified in the Remarks subsection: Scientific name, number of chromosomes, NCBI accession number. (∗P.patens genome from JGI Phytozome). (PDF 127 kb

    CGR images for three DNA sequences.

    No full text
    <p>(a) <i>Homo sapiens sapiens</i> mtDNA, 16,569 bp; (b) <i>Homo sapiens sapiens</i> chromosome 11, beta-globin region, 73,308 bp; (c) <i>Polypterus endlicherii</i> (fish) mtDNA, 16,632 bp. Observe that chromosomal and mitochondrial DNA from the same species can display different patterns, and also that mtDNA of different species may display visually similar patterns that are however sufficiently different as to be computationally distinguishable.</p

    Molecular Distance Map of three classes: Amphibia, Insecta and Mammalia.

    No full text
    <p>The method successfully clusters taxonomic groups also at the Class level. Gaps and spaces in clusters, in this and other maps, may be due to sampling bias. A topic of further exploration would be to understand the cluster shapes and nature of the distribution of sequences in this figure. The total number of mtDNA sequences is 790, the average DSSIM distance is 0.8139, and the MDS <i>Stress-1</i> is 0.16.</p

    Molecular Distance Map of all represented species from (super)kingdom Protista and its orders.

    No full text
    <p>The total number of mtDNA sequences is 70, the average DSSIM distance is 0.8288, and the MDS <i>Stress-1</i> is 0.26. The sequence-point #1466 (red) is the unclassified <i>Haemoproteus</i> sp. jb1.JA27, #1935 (grey) is <i>Babesia bovis T2Bo</i>, and #3173 (grey) is <i>Theileria parva</i>. The annotation shows that all these three species belong to the same taxonomic groups, Chromalveolata, Alveolata, Apicomplexa, Aconoidasida, up to the order level.</p

    Molecular Distance Map of class Amphibia and three of its orders.

    No full text
    <p>The total number of mtDNA sequences is 112, the average DSSIM distance is 0.8445, and the MDS <i>Stress-1</i> is 0.18. Note that the shape of the amphibian cluster and the (<i>x</i>, <i>y</i>)-coordinates of sequence-points are different here from those in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0119815#pone.0119815.g004" target="_blank">Fig. 4</a>. This is because MDS outputs a map that aims to preserve pairwise distances between points, but not necessarily their absolute coordinates.</p
    corecore